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Abstract 


The goal of Denali is to safely execute many inde- 
pendent, untrusted server applications on a single phys- 
ical machine. This would enable any developer to inject 
a new service into third-party Internet infrastructure; 
for example, dynamic content generation code could 
be introduced into content-delivery networks or caching 
systems. We believe that virtual machine monitors 
(VMMs) are ideally suited to this application domain. 
A VMM provides strong isolation by default, since one 
virtual machine cannot directly name a resource in an- 
other. In addition, VMMs defer the implementation of 
high-level abstractions to guest OSs, which greatly sim- 
plifies the kernel and avoids “layer-below” attacks. The 
main challenge in using a VMM for this application do- 
main is in scaling the number of concurrent virtual ma- 
chines that can simultaneously execute on it. 

The distinction between Denali and existing VMMs is 
that we make aggressive use of para-virtualization tech- 
niques. Para-virtualization entails selectively modifying 
the virtual architecture to enhance scalability, perfor- 
mance, and simplicity. By using para-virtualization, we 
believe Denali will be able to scale up to an order-of- 
magnitude more virtual machines than existing VMMs. 
We have implemented a prototype virtual machine mon- 
itor that runs in ring 0 on bare x86 hardware. In addi- 
tion, we have built a simple guest OS tailored to writing 
Internet services. 


1 Introduction 


Improvements in networking and computing 
technology are pushing application functionality 
into the wide-area infrastructure. This computing 
model has many advantages: services are immedi- 
ately available to clients without cumbersome soft- 
ware distribution, services are always available and 
can be accessed from any device, services can be 
administered centrally, and administration or main- 
tenance can be out-sourced to an infrastructure ser- 
vice provider rather than handled in-house. 

Many of today’s services are maintained by large 
organizations, such as Hotmail. However, the ben- 
efits of infrastructure computing should apply just 
as well to small services. A popular vision that we 


share is that any individual should be able to inject 
a new service into the Internet infrastructure for a 
small fee. As an example, a group of game play- 
ers could deploy a server to a well-connected point 
in the Internet for the duration of a multi-player 
game session. As another example, the owners of 
a web service that includes dynamically generated 
content could inject both static and dynamic por- 
tions of their site into a content-delivery network. 

These scenarios have significant trust implica- 
tions: infrastructure providers cannot trust con- 
sumers’ services, and services generally do not trust 
each other. Correspondingly, a mechanism must ex- 
ist to enforce strong isolation between services and 
the infrastructure, both in the security sense (pre- 
venting one service from corrupting another) and in 
the performance sense (fairly multiplexing physical 
resources such as CPU, memory, and network band- 
width). The simplest approach to providing this 
isolation would be to run each service on its own 
physical machine. In addition to isolating services 
from each other, this would also allow each service 
to choose its own operating system and software. 
However, dedicating physical machines to services 
is wasteful, as it eliminates the possibility of statis- 
tically multiplexing a machine across many services. 
It is also not cost-effective, as we believe there will 
be many services that neither require nor can afford 
the cost of an entire physical machine. 


1.1 Statistically multiplexing services 


The benefits of statistically multiplexing services 
are re-enforced by Zipf’s law, which states that the 
frequency of an event is proportional to x~°, where 
x is the rank of the event compared with all other 
events. Many studies of web servers, documents, 
web caches, and other network services have shown 
that popularity is almost always driven by Zipfian 
distributions [7]. Based on this, we expect that 
the popularity distribution of infrastructure services 
will also be driven by Zipf’s law. 

Zipfian distributions have two significant impli- 
cations (Figure 1). First, most requests go to a 
small number of popular services. Second, most 
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Figure 1: Zipfian service popularity distribution: 
This figure shows the CDF of requests to 10,000 hypo- 
thetical services driven by a Zipfian probability distri- 
bution, with a = 0.75. 


services are relatively unpopular, but a non-trivial 
fraction of requests go to these unpopular services. 
Because the amount of resources that a service re- 
quires is typically proportional to the workload it 
supports, popular services will require significant 
computational and networking resources. In con- 
trast, there will be a large number of services that 
require scarcely any resources, motivating the desire 
to multiplex many of them on a single computer for 
reasons of affordability and manageability. 
Fortunately, Moore’s law has resulted in com- 
modity components with enormous processing 
power, storage, and network bandwidth. A single 
modern computer can support a large amount of 
service traffic: recent SPECweb results show that 
single CPU servers can serve 2,000 HTTP requests 
per second, or 172 million requests per day. Corre- 
spondingly, we believe that if isolation can be en- 
forced without introducing prohibitive overhead, a 
single computer can host a large number of con- 
current services (hundreds, or perhaps thousands) 
while supporting an aggregate throughput that is 
comparable to a single-service computer. 


1.2 Denali: supporting lightweight pro- 
tection domains 


The Denali project seeks to implement 
lightweight protection domains that allow many 
untrusted services to execute inside the network 
infrastructure. In particular, Denali’s protection 
domains must have the following properties: 


e Strong isolation: arbitrary code executing in 
a protection domain is prevented from perturb- 
ing code executing in another domain, both in 
terms of security and performance. 


e Scales to many protection domains: in our 
application domain, we will need to execute 
hundreds or thousands of protection domains 
simultaneously on a single physical computer. 


e Rapid swapping: to support a workload con- 
sisting of many requests to unpopular services, 


the act of swapping in a service that hasn’t ex- 
ecuted in a long time must be fast. 


We also believe services will be relatively inde- 
pendent, and therefore sharing across protection do- 
mains is infrequent. Thus, a mechanism that ob- 
tains these three properties at the cost of increased 
sharing overhead is acceptable. 

In this paper, we argue that virtual machine mon- 
itors (VMMs) are one of the few practical mecha- 
nisms that provide strong enough isolation for our 
desired application domains. Our research agenda 
includes mechanisms and design techniques to en- 
hance the ability of a VMM to scale up in the 
number of concurrently executing virtual machines 
(VMs). This paper represents an initial exploration 
of this agenda, in which we use a prototype im- 
plementation of a lightweight VMM system (de- 
scribed in Section 3) to explore design possibilities. 
Section 4 provides performance measurements from 
several micro- and macro-benchmarks. We discuss 
future work in Section 5, related work in Section 6, 
and then conclude. 


2 An Argument for VMMs 


A virtual machine monitor is a software layer that 
virtualizes all of the resources of a physical machine, 
thereby defining and supporting the execution of 
multiple virtual machines (VMs) [13, 28, 29]. The 
interface exported by a VMM is a virtualized hard- 
ware/software interface, including a CPU, physi- 
cal memory, and I/O devices. A VMM typically 
executes directly on physical hardware, and more 
specifically, below the level of operating systems. 
Within each VM, a “guest” operating system pro- 
vides the customary set of high-level abstractions 
such as files or network sockets (Figure 2). 

We believe VMMs are capable of providing strong 
isolation between virtual machines, both in the secu- 
rity and performance sense. However, for VMMs to 
work in our application domain, they must demon- 
strate adequate performance as the number of con- 
currently executing VMs scales up. In the remain- 
der of this section, we discuss the isolation prop- 
erties of VMMs and then introduce some issues of 
scale that arise when executing many VMs. 


2.1 Security isolation 


One of our isolation objectives is to sandbox un- 
trusted code to prevent services from directly read- 
ing or modifying the state of other services or of the 
underlying protection system.! The requirements 





lWe are not pursuing stronger properties such as moni- 
toring information flow or eliminating covert channels. 























Qala a Qa . 
j S| S|..-| Z| | abstractions, 
protection, sharin 
abstractions, —> g~ 
sharing, naming sharing 


protection, 
resources, 
naming 


virtual machine 
monitor 


operating system 








resources —> 


hardware hardware 

















(a) (b) 





Figure 2: OSs compared with VMM«s: (a) An OS 
shares and protects high level abstractions built out of 
low-level physical resources. (b) With a VMM, protec- 
tion is below abstraction. By exposing virtualized (pro- 
tected) low-level resources, each VM can run its own OS 
to define high-level abstractions for its applications. 


of our application domain are fundamentally differ- 
ent from those that guided the design of protection 
in conventional desktop or time-sharing operating 
systems. In Denali, the unit of protection is a ser- 
vice instead of a user, and there is little need to 
share data between services (and hence protection 
domains). Virtual machine monitors are well-suited 
to this application domain because they directly ad- 
dress problems that plague conventional OSs: 

Simple, static sharing policy. VMMs impose 
a simple sharing policy: all data is private to a vir- 
tual machine unless it wishes to share that data over 
the network. The advantage of this approach is that 
it obviates the task of constructing an appropriate 
protection policy. The disadvantage is the increased 
cost of sharing data between applications. However, 
we believe this trade-off is justified, given that our 
application domain demands little sharing between 
applications. 

In contrast, the principals in a conventional OS 
are users that share data through protected abstrac- 
tions; this results in very complicated sharing poli- 
cies (e.g., “allow Jim to read file X” or “allow all 
of Sally’s programs to use the network”). The com- 
plexity of expressing an appropriate policy grows 
with the number of principals and protected ab- 
stractions. Even if an OS is flexible enough to allow 
all policies to be expressed, this complexity implies 
that, in practice, it is difficult to verify that a given 
policy behaves according to its author’s intentions. 
By removing protected sharing, VMMs avoid the 
issue of expressing complex policy. 

Protection is below abstractions. VMMs 
defer the implementation of high-level abstractions 
like file systems and network stacks to guest operat- 
ing systems. This greatly simplifies the implementa- 
tion of the VMM (which has positive security impli- 
cations), and it also eliminates “layer-below vulner- 
abilities” to which conventional operating systems 
are susceptible. 


In a conventional OS, policy is expressed in terms 
of high-level abstractions like files, instead of low- 
level resources like disk blocks. Unfortunately, ex- 
pressing protection policy in terms of abstractions 
gives rise to layer-below phenomena in which an at- 
tacker illicitly accesses resources by tunneling be- 
low the abstraction layer. For example, an attacker 
could read raw disk blocks to bypass the file system 
reference monitor, use a packet sniffer to capture the 
password of a local account, or force a core dump 
to access protected in-memory data. 

Private namespaces. With the exception of 
network addresses, all names exposed by a VMM 
are private to a VM. As a result, a VM cannot 
even construct a name that refers to the resource 
of another VM. Even if a VM is compromised by 
an attacker, it cannot access any other machine’s 
data, assuming that the VMM’s mapping from vir- 
tual to physical resources is implemented correctly. 
The only global (and hence shared) namespace that 
a VMM exposes is the set of MAC addresses on 
the virtual ethernet subnet. Security vulnerabilities 
that can be exploited over the network are beyond 
the scope of our project; however, we point out that 
any network-enabled application must be prepared 
to handle malicious traffic that arrives from the net- 
work, and that a VM that desires complete isolation 
need only drop all network traffic. 

In comparison, an operating system typically ex- 
poses several global namespaces (such as the set 
of all file names) through which users share data. 
These global namespaces can jeopardize security 
if they are misconfigured or poorly protected; for 
example, attackers could use aliases such as sym- 
bolic links to gain illicit access to resources. Global 
namespaces also grant unfettered access to an at- 
tacker that has gained supervisor privileges. 


2.2 Performance isolation 


Although the term “isolation” typically refers to 
security, an equally important aspect of service iso- 
lation is performance isolation (as evidenced by the 
rash of recent denial-of-service attacks). Our goal 
is to provide approximate resource fairness across 
services, even in the presence of malicious services 
or heavy network load. We do not aim to provide 
precise guarantees of the sort that are required by 
real-time applications. 

The need to support high-level abstractions pre- 
vents most OSs from providing strong performance 
isolation. High-level abstractions create contention 
points where applications compete for resources and 
synchronization primitives. This leads to the ef- 
fect of resource “cross-talk” [19], in which appli- 
cations’ resource management decisions interfere 
with each other. An additional problem posed by 


high-level abstractions is that precise resource ac- 
counting is difficult because resources are tied up 
in the implementation of the abstractions them- 
selves. For example, the file buffer-cache and 
TCP/IP socket buffers consume memory resources 
that aren’t “charged” to any particular application. 
Likewise, network protocol processing is often per- 
formed in the context of the running process instead 
of the receiving process, which can lead to unfair- 
ness and receiver live-lock [15]. 

By deferring the implementation of abstractions 
to guest OSs, VMMs need not suffer from these de- 
ficiencies. As we will show in Section 3, virtual 
hardware devices within the VMM act as queues 
for VMs’ resource accesses, making it possible for 
a VMM to implement policies such as fair queu- 
ing and stride scheduling. Because the VMM ex- 
poses hardware-level resources, there are fewer un- 
accounted resources than in conventional operating 
systems. 


2.3 Our challenge: scaling a VMM 


Our goal in Denali is to support a large number 
of protection domains efficiently. VMMs are known 
to introduce virtualization overhead, but as we con- 
firm in Section 4, the performance degradation from 
this overhead is modest on today’s machines. More 
importantly, there are many issues of scale that arise 
as we increase the number of concurrently execut- 
ing virtual machines. For example, at the archi- 
tectural level, as more VMs concurrently execute it 
becomes less likely that a given physical interrupt 
arrives when the VM it is associated with is run- 
ning. Issues of scale also affect operating system 
design; running hundreds of VMs implies executing 
hundreds of TCP/IP stacks on the same physical 
processor, which has implications for timer design. 

In the following section, we describe how we ex- 
ploit the notion of para-virtualization to address is- 
sues of scale. Para-virtualization exposes a virtual 
architecture that is slightly different than the physi- 
cal architecture. The differences in the architecture 
are driven by improvements in scalability or reduc- 
tions in system complexity. Modifying the architec- 
ture breaks backwards compatibility with existing 
OS code, which is a major disadvantage. However, 
it enables us to co-design the virtual architecture 
with the operating system, which gives us consider- 
able latitude when exploring issues of scale. 

Para-virtualization has been used in previous 
VMMs, including VM/370 [29] and Disco [9]. These 
systems added a combination of instructions, regis- 
ters, or devices to the virtual architecture to im- 
prove performance. However, because the goal of 
these systems was to run legacy OSs, their use of 
para-virtualization was minimized. Our contribu- 


tion is to explore architectural modifications with- 
out regard to backwards compatibility of OS code. 

A potential criticism of para-virtualization is that 
it blurs the line between a VMM and a conventional 
OS. While this is true, we chose the term “virtual 
machine monitor” because we find virtualized hard- 
ware to be a useful metaphor for implementing the 
isolating properties discussed above: no shared ab- 
stractions, a simple sharing model, and no global 
namespaces. We discuss the relationship between 
Denali and previous systems in Section 6. 


3 Design and Implementation 


In this section, we describe Denali’s para- 
virtualized architecture. In addition, we describe 
a prototype VMM and guest OS that utilize this 
architecture. 


3.1 Para-virtualization in Denali 


The Denali architecture is based on the x86 in- 
struction set; this allows most virtual instructions to 
execute directly on the physical processor. However, 
the Denali architecture differs from the underlying 
x86 architecture in a number of ways that improve 
scalability, reduce implementation complexity, and 
increase performance. 

We have introduced purely virtual instructions 
that have no counterpart in the physical architec- 
ture; these are conceptually similar to OS system 
calls, except that they are non-blocking and they 
operate at the architectural level instead of at the 
level of OS abstractions. We have modified exist- 
ing instructions’ semantics; although we cannot re- 
move instructions from the physical architecture, 
we classify certain rarely used instructions as dep- 
recated and having undefined semantics. We have 
added virtual registers as a lightweight mechanism 
for passing data between the VMM and its VMs; 
these registers are mapped to a well-known region 
of a VM’s address space. Our virtual I/O devices 
export a simplified architectural interface, designed 
in part to minimize VM/VMM boundary crossings. 
Finally, as we will describe, other architecture fea- 
tures are heavily modified (e.g., interrupt delivery) 
or eliminated (e.g., virtual memory). 


3.1.1 Para-virtualization for scalability 


A simple barrier to scaling up to hundreds of VMs 
is that an OS must execute an idle loop when it has 
no useful work to do. Ina VMM system, these loops 
waste CPU cycles, degrading the performance of the 
system. Denali introduces an “idle” instruction that 
allows a VM to yield control of the processor. Af- 
ter invoking it, the VM remains unscheduled until 


a new virtual interrupt arrives for it. By invoking 
this instruction, the VM promotes higher CPU uti- 
lization and increases its own performance, as it is 
no longer charged for these cycles. 


The idle instruction is similar to the x86’s halt 
instruction, which puts a physical machine to sleep 
awaiting an interrupt. Denali’s idle instruction en- 
hances this functionality with a timeout parameter 
that allows a VM to bound its sleep time. This ef- 
fectively introduces a yield primitive that allows for 
fine-grained sharing of the processor, such as when 
a VM is waiting for a TCP timeout to expire. 


A second scalability obstacle relates to virtual in- 
terrupt dispatching; when a physical interrupt ar- 
rives, the VMM raises a virtual interrupt in the 
appropriate VM. As the number of VMs grows, it 
becomes increasingly unlikely that the physical in- 
terrupt is destined for the currently running VM: 
handling physical interrupts destined for an inactive 
VM is the common case. One possible policy is to 
context switch to the destination VM immediately 
upon physical interrupt arrival. This synchronous 
dispatch model preserves timely delivery of inter- 
rupts, but unfortunately incurs the large cost of two 
context switches, which can result in context-switch 
thrashing as the number of VMs grow. Addition- 
ally, synchronous interrupt delivery fails to provide 
performance isolation in the presence of a denial of 
service attack. 


Instead, Denali exposes an asynchronous inter- 
rupt dispatch mechanism in which physical inter- 
rupts are queued until the target VM runs. Multiple 
interrupts destined for the same VM are batched, re- 
ducing VMM/VM boundary crossings and allowing 
the guest OS to handle virtual interrupts in an or- 
der of its own choosing. The importance of batching 
grows as the number of VMs (and hence the number 
of queued interrupts) increases. 


We have also modified the semantics of interrupts 
to improve scalability. On physical hardware, inter- 
rupts generally imply that something just happened. 
In Denali, a virtual interrupt implies that something 
happened in the recent past, possibly while you were 
context switched out. This semantic shift is particu- 
larly useful in the implementation of virtual timers. 
An OS typically maintains a “ticks” variable that is 
incremented on each hardware timer tick; mimick- 
ing this behavior on a VM requires raising a virtual 
interrupt for each timer tick that occurs while the 
VM isn’t running. Instead, Denali raises a “time has 
passed” virtual interrupt, and it exposes the number 
of physical timer ticks since system start in a vir- 
tual register. This eliminates additional VMM/VM 
crossings to determine elapsed physical time. 


3.1.2 Para-virtualization for simplicity 


Hardware architectures are complex. Precisely 
replicating a physical machine requires a VMM to 
emulate many hardware constructs: privileged ma- 
chine instructions, virtual memory, the BIOS, and 
I/O devices. However, some of these features are 
not necessary for our application domain. Para- 
virtualization provides an opportunity to remove or 
modify these features, vastly simplifying our VMM. 

An example of architectural complexity is the 
presence of non-virtualizable instructions in the x86 
instruction set [37]. These instructions behave dif- 
ferently in user mode and kernel mode; because vir- 
tual machines execute with the physical processor 
in user mode, this breaks backwards compatibility 
with legacy code. As a result, x86 VMMs such as 
VMWare and Plex86 [32] require elaborate binary- 
rewriting and virtual memory protection techniques 
to prevent these instructions from being directly ex- 
ecuted. Because we are not concerned with back- 
wards compatibility, we are content to deprecate 
these instructions.” Although the effects of these 
instructions are undefined, they are confined to the 
issuing VM. 

A more radical architectural simplification is that 
Denali does not expose virtual memory hardware. 
Denali’s virtual machines are constrained to use 
single address spaces, implying the use of library 
OSs similar to those in the Exokernel [17]. We be- 
lieve this change is warranted because Denali tar- 
gets small applications that do not require internal 
protection mechanisms. If multiple protection do- 
mains within an application are required, language 
techniques can be employed (as in Pilot [18]). 

We have eliminated other x86 components from 
our virtual architecture as well. The BIOS is used 
primarily to bootstrap a conventional OS and to 
determine system-specific parameters.? We have re- 
placed the BIOS bootstrap functionality by simply 
having the VMM load a VM’s image into memory, 
much like a process is loaded by an OS. System pa- 
rameters such as CPU speed and the size of (virtual) 
physical memory are accessible in read-only virtual 
registers. 

Our final set of architectural simplifications relate 
to I/O devices. Denali exports a small number of 
generic devices, rather than the large number of het- 
erogeneous devices found on most systems; we cur- 
rently support a network interface card, a serial de- 





2The most commonly used of these are pushl and popl, 
which are used to enable and disable interrupts. We re- 
placed this functionality with a virtual register that serves as 
an interrupt-enabled flag. This also eliminates a VMM/VM 
crossing when virtual interrupts are enabled or disabled. 

3The BIOS also contains power management functions; 
Denali does not expose power management to VMs. 
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Figure 3: The Yakima virtual machine monitor: 
Arrows represent control and data flow. Some compo- 
nents (such as the virtual keyboard) are not shown. 


vice, a timer, and a generic keyboard/console. De- 
vice interfaces are streamlined to minimize the num- 
ber of VMM/VM crossings. For example, transmit- 
ting any number of ethernet frames requires a single 
virtual I/O instruction. By contrast, existing phys- 
ical NICs can require a dozen I/O instructions to 
implement the same functionality [39]. Addition- 
ally, Denali’s virtual devices do not require initial- 
ization during startup, simplifying guest OS device 
driver implementation and reducing OS boot time. 


3.2 The Yakima VMM implementation 


Our prototype VMM implementation, called 
Yakima, runs in ring 0 on bare x86 hardware. 
Yakima is event-driven and non-blocking, and the 
only thread owned by the VMM is an idle thread. 
Currently, Yakima only runs on uniprocessors, but 
we have designed it to be extensible to SMPs. 

Figure 3 illustrates the major components of 
Yakima. At the lowest layer of the system are sup- 
port libraries from the Flux OSKit [22]. We use the 
OSKit as a hardware abstraction layer to simplify 
interactions with devices, page tables, interrupt vec- 
tors, and the BIOS. We also use a small portion of 
the OSKit’s libe library for dynamic memory man- 
agement and interacting with the console. We do 
not use any of the OSKit’s larger libraries for pro- 
cesses, network stacks, and similar abstractions. 

The Yakima VMM multiplexes physical resources 
and exports the Denali architecture to each VM. At 
Yakima’s core is a simple round-robin CPU sched- 
uler coupled with a timer to prevent a VM from 
stealing the CPU; improving the scheduling policy 
is a topic for future work. Affiliated with the sched- 
uler is an idle table, which contains a list of idle 
machines (those that invoked the idle instruction 
described in Section 3.1.1). Yakima wakes up an 
idling VM when either a virtual interrupt arrives or 
its idle timeout value is exceeded. 

The instruction emulator implements instruc- 


tions that cannot be directly executed by a VM, 
including I/O instructions and the halt instruction 
that terminates a VM. The instruction emulator 
also implements virtual instructions that have no 
counterpart in the x86 architecture. Denali’s virtual 
instructions are mapped to illegal opcodes, which 
Yakima traps and emulates. At the moment, our 
only virtual instruction is the idle instruction. 

Yakima emulates an ethernet subnet; each ma- 
chine has its own MAC address and is outwardly 
indistinguishable from a physical machine. Yakima 
maintains receive and transmit FIFOs on behalf of 
each VM; these emulate the FIFOs that exist on real 
NICs. At a lower layer, a packet scheduler and a 
virtual ethernet switch perform network multiplex- 
ing and demultiplexing, respectively. Currently, the 
packet scheduler uses a simple round-robin schedul- 
ing policy. We plan to explore more sophisticated 
fair queuing policies in the future. 

Yakima’s approach to memory management is to 
statically allocate physical pages for each active vir- 
tual machine. Although static allocation is ineffi- 
cient, it is simple to implement and avoids worst- 
case thrashing behavior. To date, static memory 
allocation has proven to be reasonable for our ap- 
plication domain: our web server VM requires only 
12 megabytes of memory, which allows for over 80 
concurrently active VMs on a physical machine with 
1 gigabyte of physical memory. 

The protection of each virtual machine’s physical 
address space works in much the same fashion as a 
conventional OS. Yakima maintains page tables for 
each virtual machine; the address space visible to a 
VM contains a VM-accessible region, and a second 
region which is only accessible to the VMM.4 

Yakima also includes support for a supervisor vir- 
tual machine, which is responsible for bootstrap- 
ping the VMM. The supervisor VM has access to 
privileged VMM calls to create and destroy other 
VMs. Currently, any user with physical access to 
the machine can issue supervisor calls. If more so- 
phisticated security policies are desired, it would be 
straightforward to replace or enhance the supervisor 
VM with additional functionality. 


3.3 Ilwaco: an example guest OS 


We have developed a simple guest OS, named 
Ilwaco, which provides high-level abstractions to ap- 
plications and shields applications from the details 





4While this appears to violate the namespace isolation ad- 
vocated in Section 2, we argue that isolation is not affected 
because protection is still based on statically mapped page ta- 
ble entries. Whether these entries reside only in the VMM ad- 
dress space or are mapped into each VM but protected from 
VM access is irrelevant. Exposing the VMM’s address space 
in each VM eliminates TLB flushes on VMM/VM crossings, 
vastly reducing virtualization overhead. 


of the Denali virtual architecture. Among the ab- 
stractions provided by Ilwaco are a TCP/IP stack, 
a threads package, a subset of the libe library, and 
the BSD sockets interface. 

The Ilwaco TCP/IP stack is a port of the Alpine 
user-level TCP/IP infrastructure [16]. Alpine con- 
sists of the FreeBSD 3.3 stack and a support library 
that emulates the BSD kernel environment. We 
modified the support library to use Denali’s inter- 
rupt and timer models, and linked the stack against 
a device driver for the Denali virtual NIC. 

Ilwaco contains a threads package that includes 
basic primitives such as fork, kill, locks, and condi- 
tion variables. Ilwaco threads are non-preemptive; 
this simplified development since the BSD TCP/IP 
stack assumes a non-preemptive thread environ- 
ment. If a thread performs a timed sleep operation, 
the thread scheduler adds the sleep duration to a 
priority queue. If there are no runnable threads, 
the scheduler passes the smallest sleep duration to 
the Denali virtual idle instruction. 

Finally, Ilwaco’s supported subset of libe enables 
basic console I/O via printf and scanf, string ma- 
nipulation, random number generation, and mem- 
ory management. The majority of these functions 
were ported from OSKit libraries. Some functions 
were modified to interact with Denali’s virtual hard- 
ware; for example, malloc reads the size of (virtual) 
physical memory from a virtual register. 


3.4 Work in progress 


Denali is a work in progress, and there are several 
pieces of functionality that have yet to be imple- 
mented. Our highest priority is the implementation 
of stable storage functionality. Despite the lack of 
disk, our system supports a non-trivial web server 
VM, as we will describe in the next section. 

The resource management policies in our pro- 
totype VMM are overly simplistic. Both the 
packet and CPU schedulers use simple round-robin 
scheduling policies. Neither scheduler accounts for 
the amount of resources used during a round-robin 
iteration—the packet size for network traffic and 
the quantum for CPU scheduling. Although these 
schemes are sufficient to prevent starvation, they 
are not suitable for enforcing robust performance 
isolation. We are working to incorporate exist- 
ing scheduling algorithms like fair queuing [14] and 
stride scheduling [41]. 

Our VMM can only execute as many VMs as 
will fit in physical memory: we have not yet im- 
plemented the swapping of an idle VM to disk. In 
addition, supporting a large number of inactive vir- 
tual machines will require changes to the guest OS. 
For example, the TCP stack registers timers that 


fire every 200 milliseconds and every 500 millisec- 
onds. If left unmodified, this would force an inac- 
tive virtual machine to be swapped in 5 times per 
second, regardless of whether there are any pending 
connections. There are likely to be many such ele- 
ments of a conventional OS that must be modified 
to improve the scalability of our system. 


4 Measurements 


In this section, we describe a set of micro- 
benchmarks and application level benchmarks de- 
signed to show the performance of the prototype 
Yakima VMM and of applications executing on our 
example guest operating system. 

For all of the experiments described below, we 
ran our VMM and VMs on a 1700MHz Pentium 
4 with 256KB of L2 cache, 1GB of RAM, and In- 
tel PRO/1000 PCI gigabit ethernet cards connected 
with Intel 470T ethernet switches. Within Yakima, 
we ported version 3.0.7 of Intels PRO/1000 de- 
vice driver to the Flux OSKit Linux driver glue 
substrate. We used 1500 byte MTUs for our 
ethernet packets. To generate workloads for our 
server benchmarks, we used a mixture of several 
1700MHz Pentium 4 and 930MHz Pentium III ma- 
chines, thereby ensuring that the workload genera- 
tion clients were not the bottleneck of the system. 


4.1 Micro-benchmarks 


Our first set of measurements attempt to charac- 
terize the performance of the Yakima VMM, inde- 
pendent of application-level behavior. 


4.1.1 Context switch time 


We measured the time to context switch between 
virtual machines. To quantify the effect of cache 
pollution, we considered two workloads: the “worst- 
case test” cycles through a large memory buffer be- 
tween each context switch; the “best-case test” does 
not touch memory between context switches. Pre- 
emption was disabled for these tests. 

Figure 4 describes the context switch time as a 
function of the number of virtual machines. For the 
worst-case workload, context switch time starts at 
3.9 microseconds for a single virtual machine, and 
increases to over 9 microseconds for multiple VMs. 
For the best-case workload, the context switch time 
starts at 1.4 microseconds for a single virtual ma- 
chine and exhibits small peaks as we exhaust the 
capacity of the L1 and L2 caches. 

Taken as a whole, these numbers suggest that De- 
nali’s context switch time is manageable. Even the 
worst-case time of 9 microseconds is small relative 
to the thousands of microseconds that are required 
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Figure 4: Yakima context switch time: The worst- 
case context switch time, with a memory intensive work- 
load, tops out at 9.4 microseconds. The best-case con- 
text switch time, whose workload does not touch mem- 
ory, tops out at 3.2 microseconds. 


for TCP/IP protocol processing (refer to Figure 6 
below). In addition, over 40% of the context switch 
time is devoted to simply entering and exiting the 
kernel, which suggests that a conventional OS would 
demonstrate similar performance. 


4.1.2 Control flow from VM to VMM 


VM to VMM control flow transfers can happen 
in two situations. First, the supervisor VM can in- 
voke a privileged system call into the VMM by ex- 
ecuting the int instruction. Second, the VMM will 
trap and emulate privileged instructions, which hap- 
pens when a VM issues programmed I/O using the 
inb/outb instructions. The end result of both is 
the same: control is vectored to a kernel address 
specified in the x86 interrupt descriptor table?. 

We measured the transfer time from VMs using 
a null system call and a generic programmed I/O 
instruction. The null system call is slightly cheaper 
than a programmed I/O instruction: 1759 cycles 
for the system call versus 2129 cycles for the PIO. 
In retrospect, using the int instruction for all con- 
trol transfers, instead of using inb/outb for PIOs, 
would provide slightly better performance. Fortu- 
nately, the performance difference is not noticeable 
in practice. 


4.1.3 Packet processing overhead 


Figure 5 shows the cost of packet processing for 
application-level UDP packet transmission and re- 
ception, for both 100 and 1400 byte packets. 

A transmitted packet first traverses the TCP 
stack and then is processed by the guest OS device 
driver. This driver signals the virtual NIC using a 
PIO, resulting in a trap into the VMM. Inside the 
VMM, the virtual NIC implementation copies the 
packet out of the guest OS into a Tx FIFO. Once 





5PIO instructions raise a general protection fault when 
executed in user mode. 


the VMM has decided to transmit the packet, the 
physical device driver is invoked. 

Packet reception essentially follows the same 
path in reverse. When the physical NIC receives 
a packet, it raises an interrupt, causing a device 
driver to execute. The driver hands the packet to 
the VMM, which then demultiplexes it into an ap- 
propriate virtual NIC Rx FIFO. When the virtual 
NIC is ready to hand the packet to its VM, the 
VMM copies the packet into the guest OS, and raises 
a virtual interrupt. The guest OS’s device driver 
processes the packet and gives it to the TCP stack, 
eventually resulting in the packet being handed to 
the application. 

As indicated in Figure 5, the physical device 
driver and TCP stack incur significantly more cost 
than the VMM itself. Handling a received packet 
in the physical device driver represents 43.3% and 
38.4% of the total packet processing costs for small 
and large packets, respectively. A non-trivial por- 
tion of this cost is due to the Flux OSKit’s interac- 
tion with the 8259A PIC; we plan on modifying the 
OSKit to use the more efficient APIC. The TCP 
stack represents 37.3% and 41.8% of a small and 
large received packet’s processing time, respectively. 
Of course, it is possible to optimize the stack within 
the guest OS to reduce overhead. 

The transmit path currently incurs two packet 
copies and one VM/VMM boundary crossing; it 
may be possible to eliminate one or both of these 
copies using virtual memory copy-on-write tech- 
niques. The receive path incurs the cost of a packet 
copy, an mbuf deallocation within the Flux OSKit, 
and a VMM/VM crossing. The mbuf deallocation 
attempts to coalesce the mbuf memory back into a 
global pool, and is therefore fairly costly. With ad- 
ditional optimization, we believe we can lower this 
cost as well. 


4.2 Application-level benchmarks 


The second set of measurements that we gath- 
ered were end-to-end measurements of network ap- 
plications running on top of our Ilwaco guest OS. 
These measurements show two things. First, the 
absolute performance numbers we obtain are com- 
parable with those of a conventional operating sys- 
tem (Linux), and therefore that the overhead of vir- 
tualization is not prohibitive. Second, performance 
is upheld as we scale up the number of virtual ma- 
chines running on a single system. These scaling 
numbers are only a start: as future work, we plan 
on exploring scaling issues in detail as we scale up 





6While the device driver appears to be less expensive on 
the transmit path, the cost of interacting with the NIC is 
not included in these numbers, since this interaction is asyn- 
chronous and largely driven by the NIC itself. 
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Figure 5: Packet processing overhead: These two timelines illustrate the cost (in cycles) of processing a packet, 


broken down across various functional stages, for both packet reception and packet transmission. 


Each pair of 


numbers represents the number of cycles executed in that stage for 100 byte and 1400 byte packets, respectively. 


far beyond the roughly 10 VMs that we present in 
this paper. 


4.2.1 TCP/IP performance 


To measure TCP/IP throughput and latency, we 
wrote a simple application that opens a TCP con- 
nection to a remote server and sends data as quickly 
as possible. The remote server calculates aggregate 
TCP/IP throughput across all measured VMs. To 
measure end-to-end latency, we ping the supervisor 
machine while the throughput test is in progress. 


The results of these tests are shown in Figure 6. 
TCP/IP throughput reaches a peak value of 560 
Mb/s for 2 virtual machines.’ The low value for a 
single VM is due in part to TCP dynamics: because 
our TCP/IP stack runs within a VM, the system 
must cross the VMM/VM boundary before sending 
data in response to an ACK. This overhead cannot 
be optimized without pushing knowledge of TCP 
into the VMM. Fortunately, these effects are masked 
with multiple VMs because the VMM maintains an 
outbound transmit FIFO for each virtual machine; 
this means that the VMM can overlap packet trans- 
mission from one VM with computation within an- 
other. The bandwidth drop for more than 2 VMs is 
likely due to cache contention: the size of the TCP 
socket buffers is exactly half the L2 cache. 

When only the supervisor VM was running, the 
baseline ping time was 212 microseconds. Each ad- 
ditional TCP-intensive virtual machine caused the 
ping time to increase by roughly 3 milliseconds. 
This suggests that latency may prove to be more 
problematic than bandwidth when running a large 
number of concurrent VMs, and that additional 
techniques for reducing each VM’s compute time 
may be required. Eliminating unnecessary memory 
copies should prove beneficial in this respect. 





TWe did not explicitly consider fairness between VMs in 
this test. Informally, we observed a low variation of band- 
width across VMs. 
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Figure 6: TCP throughput and latency vs. # 
VMs: The top graph shows the average TCP through- 
put for a set of n virtual machines. Throughput reaches 
a peak value of 560 Mb/s for 2 VMs . The bottom graph 
shows end-to-end latency, which increases by roughly 3 
milliseconds per virtual machine. 


4.2.2 Web server performance 


Our final set of benchmarks test the performance 
of a web server running on top of Ilwaco. We 
implemented our own simple multi-threaded server 
that dispatches incoming requests to a pre-allocated 
thread pool. The web server serves static content; 
because we do not yet have a file system in Ilwaco, 
we used a Perl script to embed a document tree in 
a Unix file system into a set of C source files that 
was statically compiled into the application. 

We used the httperf [34] tool running on Linux 
2.4.7 to generate workloads for our benchmarks. Us- 
ing this tool, we subjected the web server to two 
different workloads, varying the rate at which we 
generated requests and measuring the throughput, 
latency, and error rate of the server under these dif- 
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Figure 7: Web server performance: As a function of offered load and the number of concurrent VMs, these 
graphs illustrate (a) the throughput (in connections/s and megabits/s) of our web server, (b) the download latency 
of our web server, and (c) the error rate of our web server. The “small” document workload has an average document 
length of 2079 bytes, and the “large” document workload has an average document length of 134,128 bytes. 


ferent loads. We also varied the number of simulta- 
neously running web server VMs; the reported num- 
bers below represent the aggregate load summed up 
over all VMs. The small document workload con- 
sists of repeated requests to a set of small HTML 
files averaging 2,079 bytes including HTTP headers. 
The large document workload consists of repeated 
requests to a single 134,128 byte PDF file. 


Figure 7 illustrates our results. For the small 
document workload, a single VM achieved a peak 
sustained response rate of 5,379 documents/second. 
This rate greatly surpasses the saturation point 
of most popular web servers (including Apache), 
but this is not a fair comparison, since Apache 
is more fully featured than our server. For the 
large document workload, a single VM achieved 
a peak sustained data throughput of 449.6 Mb/s, 
which is commensurate with the peak aggregate 
TCP throughput observed in Section 4.2.1. 


As we scaled up the number of VMs, we noticed 
that the aggregate bandwidth served by the entire 
system dropped, although we observed that band- 
width was fairly split across the VMs. For the small 
document workload, the aggregate throughput for 8 
VMs was 72.8 Mb/s, compared with 86.6 Mb/s for 
the single VM case. For the large document work- 
load, the 8 VM aggregate throughput was 324 Mb/s, 
compared with 427.7 Mb/s for the single VM case. 
We believe that this drop in performance is due to 
a combination of context switching overhead and 
perturbations in TCP dynamics because of context 
switching across machines; proving this hypothesis 


is the subject of future work. 

For comparison, we reran our benchmarks 
against the same web server running on Linux. We 
compiled the server against a Linux library imple- 
menting Ilwaco’s system call API. For the small doc- 
ument workload, we observed a peak response rate 
of 5,490 documents/second, and for the large docu- 
ment workload, we measured a sustained through- 
put of 507.9 Mb/s. The fact that Linux achieves 
only slightly higher TCP/IP throughput demon- 
strates that the overhead of virtualization intro- 
duced by Yakima is relatively small. 


5 Future Directions 


As we discussed in Section 3.4, there are several 
short-term directions that we are pursuing, includ- 
ing adding disk support, swapping of idle VMs, the 
exploration of more sophisticated resource manage- 
ment algorithms, and exploring issues of scale in 
depth. Besides this, we have a longer-term research 
agenda. We believe that lightweight virtual ma- 
chines are a tool that enables many powerful, new 
applications that we plan to explore. 

Inserting services below the OS: VMMs pro- 
vide a layer of indirection between the guest OS and 
hardware; this facilitates the insertion of new sys- 
tem services transparently to the OS. Two services 
of interest are checkpointing and migration; VMs 
are uniquely suited to provide these because all OS 
and application state is accessible to a VMM. OS 
state such as file descriptors and socket buffers can 


be captured by the VMM in the memory and disk 
footprint of a VM. The only additional state to be 
captured is virtual hardware state, such as requests 
queued in virtual disk FIFOs. 

VMMs also provide the opportunity to hot-swap 
virtual devices transparently to a VM. For exam- 
ple, a generic virtual block device interface could be 
mapped to a conventional hard disk, a RAID array, 
or a distributed disk like Petal [33]. Other poten- 
tial services include NUMA memory management 
[9], fault tolerance [8], and secure logging [11]. 

Virtual machines for content distribution: 
A significant challenge for Internet services is deal- 
ing with client load that can vary over several or- 
der of magnitudes. The problem of flash-crowds on 
the web motivated replication mechanisms such as 
content delivery networks and proxy caches; unfor- 
tunately, these systems can only handle static data 
to date. According to a recent study by Wolman et 
al. [43], between 20-40% of web documents are dy- 
namically generated and are therefore not amenable 
to these systems. 

We propose using VMMs inside a CDN to gener- 
ate dynamic content at the edge of the network. In 
addition to providing improved availability through 
replication, CDNs can mitigate wide-area network 
failures, which are a significant cause of service out- 
ages [10]. VMMs enforce security and performance 
isolation, allowing service providers to provide guar- 
anteed service levels to their clients. Lightweight 
VMs would let a CDN host a large volume of dy- 
namic content, and VM migration would enable the 
demand-loading of active content into CDNs. 

Virtual clusters: VMMs introduce the pos- 
sibility of emulating many virtual clusters on top 
of a physical cluster of workstations [3]. In such 
a system, a service author would provide a collec- 
tion of processes that would execute on a collection 
of virtual cluster nodes. Cooperating VMMs exe- 
cuting on each physical node in the cluster would 
map each service’s virtual machines onto the physi- 
cal resources in the cluster. This would enable bet- 
ter multiplexing among a large set of services with 
bursty request streams. As the relative load on a 
virtual cluster increases, the system could increase 
that virtual cluster’s relative share of physical re- 
sources. We also anticipate having the VMMs in- 
teract with L2/L4 load balancing switches, in order 
to map externally visible IP addresses onto virtual 
clusters. 

The ability to migrate VMs can be used as a load- 
balancing mechanism in virtual clusters. If two pop- 
ular virtual clusters map to the same physical ma- 
chines, some of the virtual machines can be moved 
to take advantage of idle resources elsewhere. Also, 
migration would enable under-utilized machines to 


be switched off, which has positive repercussions for 
power conservation. 


6 Related Work 
6.1 Operating System improvements 


Many projects have sought to improve the OS 
as a reference monitor to isolate untrusted code. 
Privilege subsetting defines restricted rights for un- 
trusted code, distinct from normal user privileges 
(6, 12, 30]. Although this provides mechanism 
to isolate untrusted code, the problem of express- 
ing appropriate policy is not specifically addressed. 
These proposals typically do not address layer-below 
attacks or vulnerabilities due to global namespaces. 

To address file system global namespace vulner- 
abilities, some OSs provide a chroot system call 
to contain a process to a subtree in the global file 
system. In theory, additional OS extensions could 
provide similar containment in other shared names- 
paces, such as PIDs and network addresses. How- 
ever, this would not eliminate layer-below vulnera- 
bilities as the OS would still retain the significant 
complexity associated with high-level abstractions. 
For example, it has been shown that it is possi- 
ble to use cached file descriptors to break out of 
a chroot’ed namespace. 

Several systems propose mechanisms to supple- 
ment the OS reference monitor. Janus [27] allows 
a user-level server to intercept systems calls from 
untrusted processes. Software wrappers [24] pro- 
vide better performance by pushing the interception 
layer into the kernel. These systems provide mecha- 
nisms, but again do not address the problem of ap- 
propriate policy. Unfortunately, OSs have hundreds 
of system calls, implying policy is complex. Addi- 
tionally, these systems require knowledge of appli- 
cation behavior, while virtual machines do not. 

A promising approach to isolation is inferring ap- 
plication behavior using system call traces [1] and 
machine learning techniques [25]. However, this ap- 
proach confines applications to “typical” patterns of 
behavior and doesn’t respond well to non-malicious 
changes in application behavior. Virtual machines 
provide the ability to isolate untrusted code without 
knowledge of application semantics. 

WindowBox [4] seeks to isolate untrusted code to 
a virtual desktop inside Windows 2000. However, 
because it is implemented inside a conventional op- 
erating system, WindowBox’s security is limited by 
high-level abstractions and global namespaces. By 
virtualizing at a layer below operating system ab- 
stractions, VMMs are more secure. 

Fluke [23] proposes a recursive VM model, in 
which a parent can re-implement OS functionality 


on behalf of its child processes. In Denali, we vir- 
tualize at a layer below OS abstractions, whereas 
Fluke’s virtual architecture includes high-level IPC 
calls. By virtualizing at the level of hardware, we 
avoid imposing a fixed set of protection abstractions 
and nullify layer-below vulnerabilities. 

Server and multimedia systems have led to OS 
improvements for performance isolation. Resource 
containers [5] demonstrates that OS abstractions for 
resource management (processes and threads) are 
poorly suited to applications’ needs. Our work dif- 
fers in that resource management is below OS ab- 
stractions, which makes precise resource accounting 
more tractable and accurate. 

A limitation of VMs as resource principals is that 
they do not span multiple protection domains, mak- 
ing it difficult to have a common service (like a 
DNS resolver) shared by all VMs. The virtual ser- 
vices work [36] is notable for tracking resource usage 
across server boundaries. However, this work suffers 
from significant implementation complexity; it also 
relies on intercepting system calls, which is subject 
to the same caveats as the Janus work. 

Many proposals for fair resource allocation poli- 
cies exist: fair queueing for network bandwidth [14], 
stride scheduling for CPU allocation [41], and the 
Cello framework for disk bandwidth allocation [38]. 
This work is complementary to Denali; we plan on 
incorporating some of these policies in our VMM. 


6.2 Software virtual machines 


Software virtual architectures such as Java, Om- 
niWare [2], and the Microsoft Common Language 
Runtime have been proposed to isolate untrusted 
code. However, running multiple applications in 
a single VM has many of the same problems as 
running multiple applications on an OS. Libraries 
(e.g., Java’s class library) provide shared abstrac- 
tions that can be subverted through layer-below at- 
tacks. The trend toward extensible security archi- 
tectures [42] means that security policy must be ex- 
pressed in two places (the host OS and the soft- 
ware virtual machine). Finally, resource manage- 
ment within a single VM is complicated by the abil- 
ity to share resources through pointers. 

Alternatively, each application could be isolated 
in its own software VM, similar to a hardware VM 
architecture. Software VMs require complex soft- 
ware runtimes, which, in the case of Java, has led 
to numerous security vulnerabilities. We believe 
a VMM that is nearly identical to the underlying 
hardware is simpler to build and more robust.®. 





8We are intrigued by the possibility of using transparent 
instruction set mapping, as is done on the Transmeta Crusoe 
processor. 


6.3 Small kernel architectures 


VMMs have served as the foundation of several 
“security kernels” [26, 31, 35]. More recently, the 
NetTop initiative has sought to create secure virtual 
workstations running on VMWare [40]. Our work 
differs from these efforts in that we aim to provide 
scalability as well as isolation. Our work also as- 
sumes a weaker threat model: we are not concerned 
with covert channels between VMs. 

Denali is similar in many respects to microkernel 
operating systems [20]. Indeed, Denali’s virtual ma- 
chines could be viewed as single-threaded applica- 
tions on a low-level microkernel. However, the main 
focus of microkernel research is in pushing OS func- 
tionality into shared servers, which are themselves 
susceptible to the security vulnerabilities discussed 
in section 2. While it may be possible to remove 
shared servers from a microkernel system, our work 
addresses issues related to scaling to a large num- 
ber untrusted applications. On the other hand, fast 
IPC is not a major focus of our research because 
data sharing is not a requirement of our applica- 
tions domain. 

Exokernels [17, 21] eliminate high-level abstrac- 
tions to enable OS extensibility. Although this en- 
ables optimizations based on physical names, it dis- 
courages isolation because all resources exist inside a 
single, globally-visible namespace. This necessitates 
complex mechanisms to download protection policy 
based on high-level abstractions into low-level Ex- 
okernel protection mechanisms. Moreover, the vir- 
tual name spaces exposed by a VMM facilitate rapid 
swapping, since virtual-to-physical name bindings 
can be transparently modified by the VMM. Swap- 
ping applications is more difficult on an Exokernel 
system because there is no way to transparently 
remap library OS address translations. 

Nemesis [19] shares our goal of eliminating ap- 
plication resource “crosstalk”. Nemesis adopts a 
similar approach to us, pushing most kernel func- 
tionality (including protocol processing and device 
drivers) into application space. Our systems dif- 
fer in that Nemesis is not designed to sandbox un- 
trusted code; Nemesis applications share a global 
file systems and a single virtual address space. 


6.4 Virtual hosting platforms 


Numerous commercial and open-source prod- 
ucts provide support for virtual hosting, including 
freeVSD, Apache virtual hosts, the Solaris resource 
manager, and Ensim’s ServerXchange. All work 
within a conventional operating system or applica- 
tion, and therefore cannot provide the same degree 
of isolation as a VMM. 


Two commercial VMMs provide virtual host- 
ing services: VMWare’s ESX server and IBM’s 
z/VM system. By allowing ourselves to change the 
virtual architecture and co-design the OS (para- 
virtualization), we believe that Denali will scale to 
many more VMs on similar hardware than these 
products. Our work also addresses resource man- 
agement in the face of a large number of concurrent 
VMs; we are not aware of any publications on this 
subject related to these products. 


7 Conclusions 


In this paper, we argued that virtual machine 
monitors are well-suited to the task of hosting many 
untrusted applications on a single physical machine. 
VMMs defer the implementation of high-level ab- 
stractions and sophisticated protection policies to 
guest operating systems. This makes the monitor 
simpler, and facilitates strong security and perfor- 
mance isolation, at the cost of increased sharing 
overhead. 

To scale up to a large number of virtual machines, 
Denali utilizes para-virtualization, which entails se- 
lectively modifying the virtual architecture. Us- 
ing para-virtualization techniques, we co-designed a 
VMM and a guest operating system capable of sup- 
porting a non-trivial web-server application, which 
can serve over 5,379 HTTP requests per second. 

Although Denali is a work-in-progress, our work 
thus far demonstrates that it is possible to achieve 
strong isolation and reasonable performance using 
a virtual machine monitor. We also demonstrated 
that some of the issues that impact scaling up the 
number of concurrent virtual machines demand a 
reconsideration of the virtualized architecture, and 
indeed, the co-design of the virtual architecture with 
guest operating systems. 
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